10:00
Slides: bit.ly/2024-nicd-mlops
Access the virtual environment:
URL: URL
Master password: PASSWORD
URL
Password: PASSWORD
Background in Astrophysics.
Data Scientist @ Jumping Rivers:
Python & R support for various clients.
Teach courses in Python, R, SQL, Machine Learning.
Hobbies include hiking and travelling.
↗ jumpingrivers.com 𝕏 @jumping_uk
The typical data science workflow:
MLOps: Machine Learning Operations
File formats
Tidying & cleaning
Versioning
Take advantage of native MLOps tooling
Using {tidyr} and the {palmerpenguins} dataset:
Add a step for data splitting.
Open task1.txt
Give the attendees same task as code above (stored in demo1.txt) but with different dataset.
Include a bonus question for R experts (like an extra step for feature selection).
Non-R experts should just try running the solution script and ask Myles questions.
Not an R user? The solution can be found in task1_solutions.R
You have just built a data validation pipeline!
10:00
Choosing the right model can be tough!
Versioning
We’ll consider a basic nearest neighbour model for this workshop.
Let’s predict penguin species using island, flipper_length_mm and body_mass_g
Our model object can now be used to predict species:
model_pred = predict(model, penguins_data)
mean(
model_pred$.pred_class == as.character(
penguins_data$species
)
)v_model is a list with six elements
Open task2.txt
Run your solution to task 1 to prepare the data
Reproduce same modelling code as above but for different model and same dataset as in Task 1
Again, provide solution script for non-R people, and bonus question for experts.
10:00
Try deploying locally to check that your model API works as expected.
Use environment managers like {renv} to store model dependencies.
Use containers like Docker to bundle model source code with dependencies.
We deploy models as APIs which take input data and send back model predictions.
We can use a {plumber} API to deploy a {vetiver} model.
Check the deployment with:
Let’s check that our API works!
Endpoints metadata and predict allow programmatic queries:
Our model gives us predictions!
Open task3.txt
Repeat local deployment and prediction for model created in Task 2.
Provide solution for non-R people.
Make sure demo3.R contains code from demo above.
Also check that local deployment / prediction works in Posit Workbench.
10:00
Our Dockerfile contains a series of commands to:
Set the R version and install the system libraries.
Install the required R packages.
Run the API in the deployed environment.
AWS offers 2 month free trial for Amazon SageMaker.
Azure Machine Learning is offered at no extra charge to existing Azure customers.
Costs can rise depending on computational resources consumed.
Model building and deployment use different environments.
Deployment is just the beginning…
Our model may perform well with current data.
As data and user base grows, your model must scale.
The underlying distibutions of data can and will change:
The intrinsic relationship between the target variable and predictors can change:
As the data changes, our model predictions start to drift.
Identifying model drift is vital to any MLOps workflow.
Retrain the model with the latest data and redeploy.
How might we identify model drift?
Discuss
05:00
Some best practices:
As users query the model API, store the model predictions.
A drift in the distribution of the model predictions is a classic sign of drift.
As our data grows, run checks of the underlying distributions.
When drift is detected, retrain with the latest data and redeploy.
Open task4.txt
The lemurs_new.csv contains the latest version of data
Run your predicti
10:00
Retraining and redeployment can happen at the click of a button.
Encourages good practices like model versioning and packaging of source code.
Reduces human error.
Real-time data
Sensor data can provide near-instant, real time measurements.
Digital twins already used in manufacturing, agriculture, and more!